NetNews Offline 2

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Offline 2 / NetNews Offline Volume 2.iso / news / comp / lang / c++-part2 / 14228 < prev next >

Wrap

Internet Message Format | 1996-08-05 | 4.2 KB

Path: locutus.rchland.ibm.com!usenet From: Philip Staite <pstaite+@rchland.ibm.com> Newsgroups: comp.lang.c++ Subject: Re: Special sort algorithm Date: Fri, 29 Mar 1996 09:21:22 -0600 Organization: IBM Rochester, MN Message-ID: <315BFFF2.167E@rchland.ibm.com> References: <31593E09.29A0@kmd.dk> NNTP-Posting-Host: powertool.rchland.ibm.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 2.01 (X11; I; AIX 1) Carsten Bonde wrote: > > Hi > > This is not strictly related to C++, but I guess some of you may have a hint > on this anyway. > > I am trying to make a DOS file sorting program, which is to be part of a commercial product. > It must : > - Sort big files of short lines > lines are uneven length > files are XX MB > lines are < 100 chars > - Put resulting file in original file > - Run in very little memory ( < 50 K) > - Use about no extra disk space ( < 10% extra) > - Run reasonably fast > > I have tried implementing in-file bubblesort and others with the language > being C++. But they are way too slow, even with some optimizing buffers. > > Does anyone have a suggestion for a good algorithm, or maybe even source. Try this approach. Although it won't be blindingly fast, it should get the job done... Note, if you could relax the disk space requirements to allow for 2x disk storage (ie. 100% extra) then you could use a conventional external merge sort and it'd run pretty fast. Come on, disk space is cheap, an extra 50M of disk could improve speed by an order of magnitude here... Anyway, consider this algorithm: 1) Break up the initial file into < 20K sections. To do this, stat() or fstat() the file for total length. Subtrackt 20K from that, seek to that position, then move fwd to the first newline. Remember this position. Copy the remainder of the file to a temp disk file. Use truncate() or ftruncate() to chop the original file's size. (lopping off the chunk just saved to the temp file) Repeat this until original file has been diced into N files of 20K or less. This shouldn't require more than 20K over the initial file size. 2) Block bubble sort. Read two of the temp files into a 40K buffer. Perform a quicksort on this buffer. (see qsort) There should be approximately 400 lines in the buffer. 3) Keep the first 200 lines in the buffer, write the "last" 200 (or less) back to the last file read. 4) Read the "next" file into the slots in the 40K buffer just flushed to the file. 5) Perform quicksort again. 6) Repeat steps 3, 4, and 5 until there is no "next" file. 7) Now, these first 200 lines are guaranteed to be the first 200 lines in the whole data set. Write these to the original file. 8) Copy the "last" temp file to the first, delete the last. 9) Repeat steps 2-6 to get the next 200 lines in the file. 10) Append these 200 lines to the original file. 11) Repeat steps 8-10 until there are no more temp files. What this is really doing is partitioning the original file into 20K segments. Then performing a bubble sort on the segments where two at a time are brought into memory to compare. (external bubble, ugh...) Potential problems with this.... When you run qsort it may pick a really bad pivot element. After the first pass, each "half" of the buffer is partially sorted. Therefore if qsort arbitrarily picks the middle element it'll actually get a pivot that is offset well into the sorted array. May want to adjust your buffer size to 15K and maintain a 15K buffer and a 30K buffer. This way you would only have to do an initial qsort on the unsorted parts (say, as you partitioned the original file). Thereafter, you would just perform an internal merge of the "held" 15K and the 15K comming in from a temp file. This may generate a LOT of files. If the original file is 50M and you partition it into 20K files that is 2500 files. DOS is ill-equipped to deal with that many files. That is 80K of directory entries alone on disk. Note, DOS doesn't recover directory space when the files are deleted, so you'll want to make sure you do this in a temp direcotory then remove the directory to recover that 80K... -- Phil Staite, (507) 253-2529, team OS/2 internet: pstaite@vnet.ibm.com internal: pstaite@rchland